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CoUinearity and near-coUinearity of predictors cause difficulties 
when doing regression. In tiiese cases, variable selection becomes un- 
tenable because of mathematical issues concerning the existence and 
numerical stability of the regression coefficients, and interpretation 
of the coefficients is ambiguous because gradients are not defined. 
Using a differential geometric interpretation, in which the regression 
coefficients are interpreted as estimates of the exterior derivative of a 
function, we develop a new method to do regression in the presence 
of collinearities. Our regularization scheme can improve estimation 
error, and it can be easily modified to include lasso-type regulariza- 
tion. These estimators also have simple extensions to the "large p, 
small n" context. 

1. Introduction. Variable selection is an important topic because of its 
wide set of applications. Amongst the recent literature, lasso-type regular- 
ization [13, 30, 54, 59] and the Dantzig selector [10] have become popular 
techniques for variable selection. It is known that these particular tools have 
trouble handling collinearity. This has prompted work on extensions [60], 
though further developments are still possible. 

Collinearity is a geometric concept: it is equivalent to having predictors 
which lie on manifolds of dimension lower than the ambient space, and it 
suggests the use of manifold learning to regularize ill-posed regression prob- 
lems. The geometrical intuition has not been fully understood and exploited, 
though several techniques [5, 8, 30, 60] have provided some insight. Though 
it is not strictly necessary to learn the manifold for prediction [8], doing so 
can improve estimation in a min-max sense [43]. 
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This paper considers variable selection and coefficient estimation when 
the predictors lie on a lower-dimensional manifold, and we focus on the case 
where this manifold is nonlinear; the case of a global, linear manifold is a 
simple extension, and we include a brief discussion and numerical results 
for this case. Prediction of function value on a nonlinear manifold was first 
studied in [8] , but the authors did not study estimation of derivatives of the 
function. We do not consider the case of global estimation and variable selec- 
tion on nonlinear manifolds because [21] showed that learning the manifold 
globally is either poorly defined or computationally expensive. 

1.1. Overview. We interpret collinearities in the language of manifolds, 
and this provides the two messages of this paper. This interpretation allows 
us to develop a new method to do regression in the presence of collinearities 
or near-collinearities. This insight also allows us to provide a novel interpre- 
tation of regression coefficients when there is significant collinearity of the 
predictors. 

On a statistical level, our idea is to learn the manifold formed by the 
predictors and then use this to regularize the regression problem. This form 
of regularization is informed by the ideas of manifold geometry and the 
exterior derivative [34, 40]. Our idea is to learn the manifold either locally 
(in the case of a local, nonlinear manifold) or globally (in the case of a 
global, linear manifold). The regression estimator is posed as a least-squares 
problem with an additional term which penalizes for the regression vector 
lying in directions perpendicular to the manifold. 

Our manifold interpretation provides a new interpretation of the regres- 
sion coefficients. The gradient describes how the function changes as each 
predictor is changed independently of other predictors. This is impossible 
to do when there is collinearity of the predictors, and the gradient does 
not exist [51]. The exterior derivative of a function [34, 40] tells us how the 
function value changes as a predictor and its coUinear terms are simultane- 
ously changed, and it has applications in control engineering [47], physics 
[40] and mathematics [51]. In particular, most of our current work is in 
high-dimensional system identification for biological and control engineer- 
ing systems [3, 4]. We interpret the regression coefficients in the presence of 
collinearities as the exterior derivative of the function. 

The exterior derivative interpretation is useful because it says that the re- 
gression coefficients only give derivative information in the directions parallel 
to the manifold, and the regression coefficients do not give any derivative 
information in the directions perpendicular to the manifold. If we restrict 
ourselves to computing regression coefficients for only the directions paral- 
lel to the manifold, then the regression coefficients are unique and they are 
uniquely given by the exterior derivative. 
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This is not entirely a new interpretation. Similar geometric interpretations 
are found in the literature [16, 18, 19, 30, 37, 58], but our interpretation is 
novel because of two main reasons. The first is that it is the first time the 
geometry is interpreted in the manifold context, and this is important for 
many application domains. The other reason is that this interpretation al- 
lows us to show that existing regularization techniques are really estimates 
of the exterior derivative, and this has important implications for the inter- 
pretation of estimates calculated by existing techniques. We do not explicitly 
show this relationship; rather, we establish a link from our estimator to both 
principal components regression (PGR) [16, 37] and ridge regression (RR) 
[26, 30]. Links between PGR, RR and other regularization techniques can 
be shown [18, 22, 24, 25]. 

1.2. Previous work. Past techniques have recognized the importance of 
geometric structure in doing regression. Ordinary least squares (OLS) per- 
forms poorly in the presence of collinearities, and this prompted the develop- 
ment of regularization schemes. RR [26, 30] provides proportional shrinkage 
of the OLS estimator, and elastic net (EN) [60] combines RR with lasso-type 
regularization. The Moore-Penrose pseudoinverse (MP) [30] explicitly con- 
siders the manifold. MP works well in the case of a singular design matrix, 
but it is known to be highly discontinuous in the presence of near-collinearity 
caused by er r or s-in- variables. PGR [16, 37] and partial least squares (PLS) 
[2, 16, 55] are popular approaches which explicitly consider geometric struc- 
ture. 

The existing techniques are for the case of a global, linear manifold, but 
these techniques can easily be extended to the case of local, nonlinear mani- 
folds. The problem can be posed as a weighted, linear regression problem in 
which the weights are chosen to localize the problem [46] . Variable selection 
in this context was studied by RODEO [32] , but this tool requires a heuristic 
form of regularization which does not explicitly consider collinearity. 

Sparse estimates can simultaneously provide variable selection and im- 
proved estimates, but producing sparse estimates is difficult when the pre- 
dictors lie on a manifold. Lasso-type regularization, the Dantzig selector 
and the RODEO cannot deal with such situations. The EN produces sparse 
estimates, but it does not explicitly consider the manifold structure of the 
problem. One aim of this paper is to provide estimators that can provide 
sparse estimates when the regression coefficients are sparse in the original 
space and the predictors lie on a manifold. 

If the coefficients are sparse in a rotated space, then our estimators ad- 
mit extensions which consider rotations of the predictors as another set of 
tunable parameters which can be chosen with cross-validation. In variable 
selection applications, interpretation of selected variables is difficult when 
dealing with rotated spaces, and so we only focus on sparsity in the original 
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space. Numerical results show that our sparse estimators without additional 
rotation parameters do not seem to significantly worsen estimation when 
there is no sparsity in the unrotated space. 

The estimators we develop learn the manifold and then use this to regular- 
ize the regression problem. As part of the manifold learning, it is important 
to estimate the dimension of the manifold. This can either be done with 
dimensionality estimators [12, 23, 35] or with resampling-based approaches. 
Though it is known that cross-validation performs poorly when used with 
PGR [31, 41], we provide numerical examples in Section 7 which show that 
bootstrapping, to choose dimension, works well with our estimators. Also, 
it is worth noting that our estimators only work for manifolds with integer 
dimensions, and our approach cannot deal with fractional dimensions. 

Learning the manifold differs in the case of the local, nonlinear manifold 
as opposed to the case of the global, linear manifold. In the local case, we 
use kernels to localize the estimators which (a) learn the manifold and (b) 
do the nonparametric regression. For simplicity, we use the same bandwidth 
for both, but we could also use separate band widths. In contrast, the linear 
case has faster asymptotic convergence because it does not need localization. 
We consider a linear case with errors-in-variables where the noise variance 
is identifiable [28, 31], and this distinguishes our setup from that of other 
linear regression setups [13, 30, 54, 59, 60]. 

2. Problem setup. We are interested in prediction and coefficient esti- 
mation of a function which lies on a local, nonlinear manifold. In the basic 
setup, we are only concerned with local regression. Consequently, in order to 
prove results on the pointwise-convergence of our estimators, we only need 
to make assumptions which which hold locally. The number of predictors is 
kept fixed. Note that it is possible that the dimension of the manifold varies 
at different points in the predictor space; we do not prohibit such behavior. 
We cannot do estimation at the points where the manifold is discontinuous, 
but we can do estimation at the remaining points. 

Suppose that we would like to estimate the derivative information of the 
function about the point Xq £ MP, where there are p predictors. The point 
Xq is the choice of the user, and varying this point allows us to compute the 
derivative information at different points. Because we do local estimation, 
it is useful to select small portions of the predictor-space; we define a ball 
of radius R centered at X in p-dimensions using the notation: = {v £ 

W:\\v-x\\l<R}.'^ 

We assume that the predictors form a d-dimensional manifold M in a 
small region surrounding Xq, and we have a function which lies on this 



^In our notation, we denote subscripts in lower case. For instance, the ball surrounding 
the point Xo is denoted in subscripts with the lower case xo- 
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Fig. 1. In a small ball B%^^^ about the point Xq, the predictors form the manifold M. 
The response variable is a function of the predictors, which lie on this manifold. Here the 
manifold is of dimension d = 1, and the number of predictors is p = 2. 

manifold /(•) : — t- M. Note that d<p, and that d is in general a function 
of Xq] however, implicit in our assumptions is that the manifold M is con- 
tinuous within the ball. We can more formally define the manifold at point 
Xq as the image of a local chart: 

(2.1) M = {(/.(u) G eg,,,^ cW:ue C M'^}, 

for small /i, r > 0. An example of this setup for p = 2 and d = l can be seen 
in Figure 1. 

We make n measurements of the predictors Xi G M^, for i = {1, . . . ,n}, 
where the Xi are independent and identically distributed. We also make 
n noisy measurements of the function Yi = f{Xi) + ej, where the Ei are 
independent and identically distributed with E(ej) = and Var(ej) = . 
Let K, M > be finite constants, and assume the following: 

1. The kernel function K{-), which is used to localize our estimator by se- 
lecting points within a small region of predictor-space, is three-times dif- 
ferentiable and radially symmetric. These imply that K{-) and K"{-) are 
even functions, while K'{-) is an odd function. 

2. The bandwidth h is the radius of predictor points about Xq which are 
used by our estimator, and it has the following asymptotic rate: h = 

^^-l/(d+4)_ 

3. The kernel K{-) either has exponential tails or a finite support [8]. Math- 
ematically speaking, 

E[K^{{X - x)/h)w{X)l{X G {Bl^.-.n = o(/i'^+^), 
for 7 G {1,2}, < e < 1 and \w{x)\ < M(l + jxp). 
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4. The local chart which is used to define the manifold in (2.1) is in- 
vertible and three-times differentiable within its domain. The manifold 

is a differentiable manifold, and the function /(•) is three-times dif- 
ferentiable: \\didjdk{f o (/>)||oo < M. 

5. The probability density cannot be defined in the ambient space because 
the Lebesgue measure of a manifold is generally zero. We have to de- 
fine the probability density with respect to a d-dimensional measure by 
inducing the density with the map [8]. We define: 

where S C M.P. The density Q(-) is denoted by F(z), and we assume that 
it is three-times differentiable and strictly positive at (z = 0) G (j)^^{XQ). 

6. The Tikhonov-type regularization parameter is nondecreasing and 
satisfies the following rates: Xn/nh'^'^'^ — )■ oo and hXn/nh'^'^'^ — )• 0. The 
lasso-type regularization parameter /x„ is nonincreasing and satisfies the 
following rates: /i„(n/i'^+2)-i/2 _^ q ^^(-^;jd+2^ (7-1)72 _^ 

The choice of the local chart </>(•) is not unique; we could have chosen a 
different local chart Fortunately, it can be shown that our results are 
invariant under the change of coordinates o (/> as long as the measure 
Q(-) is defined to follow suitable compatibility conditions under arbitrary, 
smooth coordinate changes. This is important because it tells us that our 
results are based on the underlying geometry of the problem. 



3. Change in rank of local covariance estimates. To localize the regres- 
sion problem, we use kernels, bandwidth matrices and weight matrices. We 
define the scaled kernel Kh{U) = h~^K{U/h), where h is a bandwidth. The 
weight matrix centered at Xq with bandwidth h is given by 

W^, = dmg{Kh{Xi - Xo), . . . , Kh{X^ - Xq)), 

and the augmented bandwidth matrix is given by H = H^I'^H^I^, where 





hi 



pxp 



If we define the augmented data matrix as 

-1 {Xi 
-^xo ~ : : 

.1 (Xn-Xo) 

then the weighted Gram matrix of X^a is 



(3.1) 



rill 

i21 



r<i2 

22 
n 



W>-H~^I^X',^W,,X.,,H~^I\ 
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A formal statement is given in the Appendix, but the weighted Gram matrix 
(3.1) converges in probabihty to the following population parameters: 



C^^ = F{0) [ K{du(p-u)du, 



(3.2) 



C2i = Ci2' = 0, 
= F{0)d^^ ■ 



K(du(p{0) • u)uu' du 



■ do, 



If we expand the C*^^ term from the weighted Gram matrix (3.1) into 

(3.3) = ^^^^^X;K;,(X, -Xo)(X, - Xo)iX, - XoY, 

1=1 

it becomes apparent that can be interpreted as a local covariance matrix. 
The localizing terms associated with the kernel add a bias, and this causes 
problems when doing regression. The C^^ term does not cause problems 
because it is akin to the denominator of the Nadaraya-Watson estimator 
[15] which does not need any regularizations. 

The bias of the local covariance estimate (7^^ causes problems in doing 
regression, because the bias can cause the rank of (7^^ to be different than 
the rank of C^^. The change in rank found in the general case of the local, 
nonlinear manifold causes problems with MP which is discontinuous when 
the covariance matrix changes rank [1]. In the special case of a global, linear 
manifold, a similar change in rank can happen because of errors-in- variables. 
It is worth noting that MP works well in the case of a singular design matrix. 



4. Manifold regularization. To compensate for this change in rank, we 
use a Tikhonov-type regularization similar to RR and EN. The distinguish- 
ing feature of our estimators is the particular form of the regularizing matrix 
used. Our approach is to estimate the tangent plane at Xq of the manifold 
M and then constrain the regression coefficients to lie close to the prin- 
cipal components of the manifold. The idea for this type of regularization 
is informed by the intuition of exterior derivatives.^ An advantage of this 
regularization is that it makes it easy to apply lasso- type regularizations, 
and this combination of the two types of regularization is similar to EN. 

To constrain the regression coefficients to lie close to the manifold, we 
pose the problem as a weighted least-squares problem with Tikhonov-type 
regularization: 

(4.1) /3 = argmin||M^(y - X(3)\\l + X\m\l 



■^We specifically use the intuition that the exterior derivative lies in the cotangent space 
of the manifold, and this statement can be mathematically written as: dxf £ T'M. 
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The matrix 11 is a projection matrix chosen to penahze /3 for lying off of the 
manifold. Contrast this to RR and EN which choose 11 to be the identity 
matrix. Thus, RR and EN do not fully take the manifold structure of the 
problem into consideration. 

Stated in another way, 11 is a projection matrix which is chosen to penalize 
the components of /3 which are perpendicular to the manifold. The cost 
function we are minimizing has the term ||n/3||2, and this term is large if /3 
has components perpendicular to the manifold. Components of /3 parallel to 
the manifold are not penalized because the projection onto these directions 
is zero. 

Since we do not know the manifold a priori, we must learn the manifold 
from the sample local covariance matrix C^. We do this by looking at the 
principal components of C*^^, and so our estimators are very closely related 
to PCR. Suppose that we do an eigenvalue decomposition of C"^: 

(4.2) Cf = [UR [/^]diag(A\...,AP)[C/« f/^]', 

where £ RP""^, £ rpxCp-'^) and > > • • • > A^. Note that the 
eigenvalue decomposition always exists because C^^ is symmetric. The esti- 
mate of the manifold is given by the d most relevant principal components, 
and the remaining principal components are perpendicular to the estimated 
manifold. 

Because we want the projection matrix 11 to project /3 onto the directions 
perpendicular to the estimated manifold, we define the following projection 
matrices 

(4.3) 

P = diag(0,n). 

The choice of d is a tunable parameter that is similar to the choice in PCR. 
These matrices act as a regularizer because d can always be chosen to en- 
sure that rank(C'^^ + An„) = p. Furthermore, we have the following theorem 
regarding the full regularizing matrix P: 

Theorem 4.1 [Lemma A. 2, part (d)]. Under the assumptions given in 
Section 2, the following holds with probability one: 

(4.4) rank(C'„ + A„P„/n/i'^+2) =p + l. 

Our estimators can perform better than PCR because of a subtle differ- 
ence. PCR requires that the estimate lies on exactly the first d most relevant 
principal components; however, our estimator only penalizes for deviation 
from the d most relevant principal components. This is advantageous because 
in practice d is not known exactly and because the principal components 
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used are estimates of the true principal components. Thus, our regulariza- 
tion is more robust to errors in the estimates of the principal components. 
Also, our new regularization allows us to easily add additional lasso-type 
regularization to potentially improve the estimation. PGR cannot be easily 
extended to have lasso-type regularization. 

We denote the function value at Xq as f\xo- Also, we denote the exterior 
derivative of /(•) at Xq as d^/Uo- Then, the true regression coefficients are 
denoted by the vector 

(4.5) /3' = [/Uo 4/Uo]- 

The nonparametric exterior derivative estimator (NEDE) is given by 

(4.6) /3 = argmin{/iP||W^V2(y - X,J)\\l + XJP^ ■ ^g}, 

where P„ is defined using (4.3) with Cn- We can also define a nonparametric 
adaptive lasso exterior derivative estimator (NALEDE) as 

(4.7) /3 = argminl h^wH^Y - XxM\l + UWn ■ P\\l + l^nY,^\H I 

where P„ is define in (4.3) using Cn, w is the solution to (4.6), and 7 > 0. 

Our estimators have nice statistical properties, as the following theorem 
shows. 

Theorem 4.2. // the assumptions in Section 2 hold, then the NEDE 
(4-6) and NALEDE estimators are consistent and asymptotically nor- 

mal: 

H^/\$-^)AcW{B',a^V), 

where B and V are, respectively, given in (A. 3) and (A. 4), and Ct de- 
notes the Moore-Penrose psuedoinverse of C . Furthermore, we asymptoti- 
cally have that /3' G M x T^M. The NALEDE (4.7) estimator has the addi- 
tional feature that 

P(sign(/3) = sign(/3)) 1. 

Note that the asymptotic normality is biased because of the bias typical 
in nonparametric regression. This bias is seen in both the NEDE (4.6) and 
NALEDE (4.7) estimators, but examining B one sees that the bias only 
exists for the function estimate fx^ and not for the exterior derivative esti- 
mate dxf\xo- This bias occurs because we are choosing h to converge at the 
optimal rate. If we were to choose /i at a faster rate, then there would be no 
asymptotic bias, but the estimates would converge at a slower rate. 
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It is worth noting that the rate of convergence in Theorem 4.2 has an 
exponential dependence on the dimension of the manifold d, and our par- 
ticular rates are due to the assumption of the existence of three derivatives. 
As is common with local regression, it is possible to improve the rate of 
convergence by using local polynomial regression which assumes the exis- 
tence of higher-order derivatives [8, 46]. However, the general form of local 
polynomial regression on manifolds would require the choice of a particular 
chart (/)(•) and domain U. Local linear regression on manifolds is unique in 
that one does not have to pick a chart and domain. 

As a last note, recall that the rate of convergence in Theorem 4.2 depends 
on the dimension of the manifold d and does not depend on the dimension p 
of the ambient space. We might mistakenly think that this means that the 
estimator converges in the "large p, small n" scenario, but without additional 
assumptions these results are only valid for when p grows more slowly than 
n. Analogous to other "large p, small n" settings, if we assume sparsity then 
we can achieve faster rates of convergence, which is the subject of the next 
section. 

5. Large p, small n. We consider extensions of our estimators to the 
"large p, small n" setting. The key difference in this case is the need to 
regularize the covariance matrix. Our NEDE (4.6) and NALEDE (4.7) esti- 
mators use the eigenvectors of the sample covariance matrix, and it is known 
[7, 33] that the sample covariance matrix is poorly behaved in the "large p, 
small n" setting. To ensure the sample eigenvectors are consistent estimates, 
we must use some form of covariance regularization [7, 33, 61]. 

We use the regularization technique used in [7] for ease of analysis and 
because other regularization techniques [33, 61] do not work when the true 
covariance matrix is singular. The scheme in [7] works by thresholding the 
covariance matrix, which leads to consistent estimates as long as the thresh- 
old is correctly chosen. We define the thresholding operator as 

Tt{m) = ml(|m| > t), 

and by abuse of notation Tt{M) is Tt{-) applied to each element of M . 

The setup and assumptions are nearly identical to those of the fixed p case 
described in Section 2. The primary differences are that (a) d,p,n increase 
at different rates toward infinity, and (b) there is some amount of sparsity in 
the manifold and in the function. The population parameters C„, analogous 
to (3.2), are functions of n and are defined in nearly the same manner, 
except with [C^^]a; = F{0)/2 J^a K[du(t) ' u)dij4>^u^u^ du. Their estimates are 
now defined 



Cn = H-^X'W,,X, 
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compare this to (3.1). Just as C„ can be interpreted as a local covariance 
matrix, we now define a local cross-covariance matrix: 



Rn 




■ /U'O 


.Rn. 




.^"n^ ' /Uo + ■ dxf\xo_ 


fiven 


by 





For the sake of brevity, we summarize the other the differences from Section 
2. The following things are different: 

1. The manifold 7W„,, local chart (^n(')) manifold dimension number of 
predictors p„, and density function are all functions of n. We drop 
the subscript n when it is clear from the context. These objects are defined 
in the same manner as in Section 2, and we additionally assume that the 
density F[-) is Lipschitz continuous. 

2. The asymptotic rates for d,p,n are given hy d = o(logn), 

/i = o((c^n/logp)-V{4+'^)), 



logp .... 

^=°(')' 

where Cn is a measure of sparsity that describes the number of nonzero 
entries in covariance matrices, exterior derivative, etc. 

3. The kernel K(-) has finite support and is Lipschitz continuous, which 
implies that 

for u ^ ^oQ' Contrast this to the second assumption in Section 2. 

4. The local chart 0n("); function /«(•) and local (cross-)covariance matrices 
Cn,Rn satisfy the following sparsity conditions: 

p 

(5.1) ^l(g^YO)<Cn and |Q^'|<M, 

k=l 

for (derivatives of the local chart; the index k denotes the kth. component 

%j 4^ 1 dijm 4^ 1 dijvan 



of the vector-valued (j)) = di4>^ , dij4>^ , dijm4>'' ^ dijmn'P^ for (derivatives 



of the function) = [dxf]k,dik{f ° 4>)jdijk{f o </>); and for (local covari- 
ance matrices) Q'' = [Cn]ik, [Rn]ik- 
5. The smallest, nonzero singular value of the local covariance matrix is 
bounded. That is, there exists e > such that 

(5.2) inf ( inf a(Cn)) > e. 
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This condition ensures that the regularized inverse of the local covariance 
matrix is well defined in the limit; otherwise we could have a situation 
with ever-decreasing nonzero singular values. 
6. The Tikhonov-type regularization parameter A„ and the lasso-type reg- 
ularization parameter fin have the following asymptotic rates: 

An = Op ( ^/(^( 

f^ncj =0(1), 



l/4x 1-7 



^3/2 (]^\ 

7. The threshold which regularizes the local sample covariance matrix is 
given by 



(5.3) tn=K.m, 

V n 

where = o(l). This regularization will make the regression estimator 
consistent in the "large p, small n" case. 



5.1. Manifold regularization. The idea is to regularize the local sample 
covariance matrix by thresholding. If the true, local covariance matrix is 
sparse, this regularization will give consistent estimates. This is formalized 
by the following theorem. 

Theorem 5.1. If the assumptions given in Section 5 are satisfied, then 

WTtiCn) -Cn\\= OpiCn^Jlogp/nh^), 
\\Tt{Rn) -Rn\\= Op{CnAjlogp/nhd). 

Once we have consistent estimates of the true, local covariance matrix, we 
can simply apply our manifold regularization scheme described in Section 4. 
The nonparametric exterior derivative estimator for the "large p, small n" 
case (NEDEP) is given by 

(5.4) = argmin||(rt(C„) + A„Pn)/3 - Tt{Rn)\\l 

where P„ is as defined in (4.3) except using Tt{C'^). The nonparametric 
adaptive lasso exterior derivative estimator for the "large p, small n" case 
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(NALEDEP) is given by 

P = argminargmin||(rt(C'„) + A„P„);S - Tt{Rn)\\l 

(5^5) 

j=l j 

where Pn is as defined in (4.3) except using Tt{C^^) and w is the solution 
to (5.4). These estimators have nice statistical properties. 

Theorem 5.2. If the assumptions given in Section 5 are satisfied, then 
the NALEDE (5.4) and NALEDEP (5.5) estimators are consistent: 

The NALEDEP (5.5) estimator is also sign consistent: 

P(sign(/3) = sign(/3)) 1. 



We do not give a proof of this theorem, because it uses essentially the 
same argument as Theorem 5.3. One minor difference is that the proof uses 
our Theorem 5.1 instead of Theorem 1 from [7]. 

5.2. Linear case. Our estimators admit simple extensions in the spe- 
cial case where predictors lie on a global, linear manifold and the response 
variable is a linear function of the predictors. We specifically consider the 
errors-in- variables situation with manifold structure in order to present our 
formal results, because: in principle, our estimators provide no improvements 
in the linear manifold case over existing methods when there are no errors- 
in-variables. In practice, our estimators sometimes provide an improvement 
in this case. Furthermore, our estimators provide another solution to the 
identifiability problem [19]; the exterior derivative is the unique set of re- 
gression coefficients because the predictors are only sampled in directions 
parallel to the manifold, and there is no derivative information about the 
response variable in directions perpendicular to the manifold. 

Suppose there are n data points and p predictors, and the dimension of 
the global, linear manifold is d. We assume that d,n,p increase to infinity, 
and leaving d fixed is a special case of our results. We consider a linear model 
rj = H/?, where 77 S M"^^ is a vector of function values, H G M"^^ is a matrix 
of predictors, and /3 S is a vector. 

The S are distributed according to the covariance matrix S^, which is 
also a singular design matrix in this case. The exterior derivative of this 



14 



A. ASWANI, P. BICKEL AND C. TOMLIN 



linear function is given by /3 = PT,^f3, where Pj]^ is the projection matrix 
that projects onto the range space of T,^. We make noisy measurements of 
T] and H: 

X = E + i^, 
Y = ri + e. 

The noise v and e are independent of each other, and each component of 
V is independent and identically distributed with mean and variance cr^. 
Similarly, each component of e is independent and identically distributed 
with mean and variance <t^. In this setup, the variance cr^ is identifiable 
[28, 31], and an alternative that works well in practice for low noise situations 
is to set this quantity to zero. 

Our setup with errors-in- variables is different from the setup of existing 
tools [10, 38], but it is important because in practice, many of the near- 
collinearities might be true collinearities that have been perturbed by noise. 
Several formulations explicitly introduce noise into the model [11, 14, 28, 
31, 41]. We choose the setting of [28, 31], because the noise in the predictors 
is identifiable in this situation. 

The exterior derivative estimator for the "large p, small n" case (EDEP) 
is given by 

(5.6) /3 = argmm\\{Tt{X'X/n) - all + A„P„)/3 - Tt{X'Y/n)\\l 

where P„ is as defined in (4.3) except applied to C'^ = Tt{X' X / n) - all. This 
is essentially the NEDEP estimator, except the weighting matrix is taken to 
be the identity matrix and there are additional terms to deal with errors- 
in- variables. We can also define an adaptive lasso version of our estimator. 
The adaptive lasso exterior derivative estimator for the "large p, small n" 
case (ALEDEP) is given by 

/3 = ^Tgmm\\{Tt{X'X/n) - all + XnPn)P - Tt{X'Y/n)\\l 

where Pn is as defined in (4.3) except applied to C'^ = Tt{X'X) — all and w 
is the solution to (5.6). We can analogously define the EDE and ALEDE es- 
timators which are the EDEP and ALEDEP estimators without any thresh- 
olding. In practice, setting the value of the al term equal to zero seems to 
work well with actual data sets. 

The technical conditions we make are essentially the same as those for the 
case of the local, nonlinear manifold. The primary difference is that we ask 
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that the conditions in Section 5 hold globally, instead of locally. This also 
means that we do not use any kernels to localize the estimators, and the W 
matrix in the estimators is simply the identity matrix. If the theoretical rates 
for the regularization and threshold parameters are compatibility redefined, 
then we can show that these estimators have nice statistical properties. 

Theorem 5.3. If the assumptions in Sections 5 and 5.2 hold, then the 
EDEP (5.6) and ALEDEP (5.7) estimators are consistent. They asymptot- 
ically converge at the following rate: 



Our theoretical rate of convergence is slower than that of other techniques 
[10, 38] because we have applied our technique for local estimation to global 
estimation, and we have not fully exploited the setup of the global case. 
However, we do get faster rates of convergence in our global case versus 
our local case. Furthermore, our model has errors- in- variables, while the 
model used in other techniques [10, 38] assumes that the predictors are 
measured with no noise. Applying the various techniques to both real and 
simulated data shows that our estimators perform comparably to or better 
than existing techniques. It is not clear if the rates of convergence for the 
existing techniques [10, 38] would be slower if there were errors-in-variables, 
and this would require additional analysis. 

6. Estimation with data. Applying our estimators in practice requires 
careful usage. The NEDE estimator requires the choice of two tuning param- 
eters, while the NALEDE and NEDEP estimators require choosing three; the 
NALEDEP estimator requires choosing four. The extra tuning parameters — 
in comparison to existing techniques like MP or RR — make our method 
prone to over-fitting. This makes it crucial to select the parameters using 
methods, such as cross-validation or bootstrapping, that protect against 
overfitting. It is also important to select from a small number of different 
parameter-values to protect against overfitting caused by issues related to 
multiple-hypothesis testing [42, 44, 45]. 

Bootstrapping is one good choice for parameter selection with our estima- 
tors [7, 9, 48-50]. Additionally, we suggest selecting parameters in a sequen- 
tial manner; this is to reduce overfitting caused by testing too many models 
[42, 44, 45]. Another benefit of this approach is that it simplifies the pa- 
rameter selection into a set of one-dimensional parameter searches — greatly 




The ALEDE (5.7) estimator is also sign consistent: 
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reducing the computational complexity of our estimators. For instance, we 
first select the Tikhonov-regularization parameter A for RR. Using the same 
A value, we pick the dimension d for the NEDE estimator. The prior values 
of A and d are used to pick the lasso-regularization parameter n for the 
NALEDE estimator. 

MATLAB implementations of both related estimators and our estimators 
can be found online.^ The lasso-type regressions were computed using the 
coordinate descent algorithm [17, 57], and we used the "improved kernel PLS 
by Dayal" code given in [2] to do the PLS regression. The increased com- 
putational burden of our estimators, as compared to existing estimators, is 
reasonable because of: improved estimation in some cases, easy paralleliza- 
tion, and computational times of a few seconds to several minutes on a 
general desktop for moderate values of p. 

7. Numerical examples. We provide three numerical examples: the first 
two examples use simulated data, and the third example uses real data. 
In the examples with simulated data, we study the estimation accuracy of 
various estimators as the amount of noise and number of data points vary. 
The third example uses the Isomap face data^ used in [53]. In the example, 
we do a regression to estimate the pose of a face based on an image. 

For examples involving linear manifolds and functions, we compare our 
estimators with popular methods. The exterior derivative is locally defined, 
but in the linear case it is identical at every point — allowing us to do the 
regression globally. This is in contrast to the example with a nonlinear man- 
ifold and function where we pick a point to do the regression at. Though 
MP, PLS, PGR, RR and EN are typically thought of as global methods, we 
can use these estimators to do local, nonparametric estimation by posing 
the problem as a weighted, linear regression which can then be solved using 
either our or existing estimators. As a note, the MP and OLS estimators are 
equivalent in the examples we consider. 

Some of the examples involve errors-in-variables, and this suggests that 
we should use an estimator that explicitly takes this structure into account. 
We compared these methods with Total Least Squares (TLS) [27] which 
does exactly this. TLS performed poorly with both the simulated data and 
experimental data, and this is expected because standard TLS is known to 
perform poorly in the presence of collinearities [27]. TLS performed compa- 
rably to or worse than OLS/MP, and so the results are not included. 

Based on the numerical examples, it seems that the improvement in esti- 
mation error of our estimators is mainly due to the Tikhonov-type regulariza- 
tion, with lasso-type regularization providing additional benefit. Threshold- 
ing the covariance matrices did not make significant improvements, partly 



''http : / /hybrid. eecs .berkeley . edu/~NEDE/EDE_Code .zip. 
''http : / / isomap. stanf ord. edu/datasets .html . 
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because bootstrap has difficulty witli picking tlie thresliolding parameter. 
Improvements may be possible by refining the parameter selection method 
or by changes to the estimator. We also observed the well-known tendency 
of lasso to overestimate the number of nonzero coefficients [39]; using sta- 
bility selection [39] to select the lasso parameter would likely lead to better 
results. 

7.1. Simulated data. We simulate data for two different models and use 
this to compare different estimators. One model is linear, and we do global 
estimation in this case. The other model is nonlinear, and hence we do 
local estimation in this case. In both models, there are p predictors and the 
dimension of the manifold is d = round(|p). The predictors ^ and response 
rj are measured with noise: 

x = e+AA(o,(j2), 

2/ = ??+AA(0,cj2). 

And for notational convenience, let q = round(^p). Define the matrix 




0.3l*-^l, iil<i,j<d, 

0.3, ifd+l<i<pAj = q + i-d, 

0.3, ifd+l<i<pAj = q + i + l-d, 

0, o.w. 



The two models are given by: 

1. Linear model: the predictors are distributed ^ = Af{0, FF'), and the func- 
tion is 

(7.1) r? = /(0 = l+ j2 

i=l 
i is odd 

If u) = [1 1 • • •] is a vector with ones in the odd index-positions and 
zeros elsewhere, then the exterior derivative of this linear function at 
every point on the manifold is given by the projection of w onto the 
range space of the matrix F. 

2. Nonlinear model: the predictors are distributed ^ = sin(AA(0, FF')), and 
the function is 

(7.2) r? = /(e) = l+ 

i=l 
i is odd 

We are interested in local regression about the point xq = ^ ■ ■ ■ 0] . If 
w = [1 1 • • •] is a vector with ones in the odd index-positions and zeros 
elsewhere, then the exterior derivative of this nonlinear function at the 
origin is given by the projection of w onto the range space of the matrix F. 
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Table 1 shows the average square- loss estimation error ||/3 — /3||| for differ- 
ent estimators using data generated by the linear model and nonlinear model 
given above, over different noise variances and number of data points n. We 
conducted 100 replications of generating data and doing a regression, and 
this helped to provide standard deviations of square-loss estimation error 
to show the variability of the estimators. Table 2 shows computation times 
in seconds for the different estimators; it shows that our estimators require 
more computation, but the computation time is still reasonable. The com- 
putation time does not depend on the noise level, and so we have averaged 
the computation times over the 100 + 100 replications for = 0.01,0.10. 

One curious phenomenon observed is that the estimation error goes down 
in some cases as the error variance of the predictors cj^ increases. To un- 
derstand why, consider the sample covariance matrix in the linear case 
S = X'X/n with population parameter S = FF' + u^I. Heuristically, the 
OLS estimate will tend to {FF' + a'^T)~^X'Y/n, and the error in the pre- 
dictors actually acts as the Tikhonov-type regularization found in RR, with 
lower levels of noise leading to less regularization. 

Table 1 

Averages and standard deviations over 100 replications of square-loss estimation error 
for different estimators using data generated by the linear model and nonlinear model 
given in Section 7.1, over different noise variances and number of data points n 





n 


= 10 


n = 


100 


n = 


1000 






Linear model: 


= 0.01,(7^ 


= 1.00 




OLS/MP 


3.741 


(2.476) 


6.744 


(3.602) 


0.588 


(0.368) 


RR 


2.523 


(1.020) 


0.369 


(0.193) 


0.167 


(0.086) 


EN 


2.562 


(1.054) 


0.117 


(0.197) 


0.017 


(0.008) 


PLS 


2.428 


(0.536) 


0.501 


(0.207) 


0.031 


(0.013) 


PGR 


3.391 


(0.793) 


1.629 


(0.144) 


1.583 


(0.047) 


EDE 


2.528 


(1.030) 


0.367 


(0.185) 


0.166 


(0.086) 


ALEDE 


2.564 


(1.061) 


0.111 


(0.177) 


0.015 


(0.006) 


EDEP 


2.527 


(1.025) 


0.370 


(0.184) 


0.167 


(0.085) 


ALEDEP 


2.563 


(1.057) 


0.111 


(0.177) 


0.015 


(0.006) 






Linear model: cr^ 


= 0.10, cr^ 


= 1.00 




OLS/MP 


3.629 


(1.488) 


1.259 


(0.598) 


0.173 


(0.050) 


RR 


2.717 


(0.847) 


0.892 


(0.351) 


0.260 


(0.042) 


EN 


2.783 


(0.895) 


0.661 


(0.523) 


0.064 


(0.013) 


PLS 


2.588 


(0.566) 


0.740 


(0.252) 


0.138 


(0.037) 


PGR 


3.425 


(0.747) 


1.645 


(0.144) 


1.569 


(0.047) 


EDE 


2.716 


(0.846) 


0.840 


(0.305) 


0.255 


(0.042) 


ALEDE 


2.782 


(0.893) 


0.590 


(0.459) 


0.050 


(0.013) 


EDEP 


2.720 


(0.849) 


0.841 


(0.305) 


0.255 


(0.042) 


ALEDEP 


2.784 


(0.895) 


0.590 


(0.459) 


0.050 


(0.013) 
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Table 1 
( Continued) 





n 


= 20 


n = 


100 


n = 


1000 






Nonlinear model: cr^ = 


= 0.01, cr^ = 


= 1.00 




OLS/MP 


657.7 


(1842) 


3.797 


(2.172) 


0.367 


(0.223) 


RR 


2.188 


(0.715) 


1.160 


(0.265) 


0.300 


(0.132) 


EN 


2.140 


(0.768) 


1.083 


(0.341) 


0.162 


(0.081) 


PLS 


2.102 


(0.757) 


0.995 


(0.256) 


0.172 


(0.120) 


PGR 


2.731 


(0.439) 


1.679 


(0.244) 


0.123 


(0.043) 


NEDE 


2.184 


(0.716) 


1.104 


(0.269) 


0.288 


(0.118) 


NALEDE 


2.136 


(0.769) 


1.009 


(0.357) 


0.144 


(0.067) 


NEDEP 


2.184 


(0.716) 


1.103 


(0.269) 


0.288 


(0.118) 


NALEDEP 


2.136 


(0.769) 


1.008 


(0.357) 


0.144 


(0.067) 






Nonlinear 


model: 


= 0.1,^2 = 


1.00 




OLS/MP 


147.5 


(338.3) 


1.843 


(0.975) 


0.473 


(0.110) 


RR 


2.693 


(1.793) 


1.260 


(0.369) 


0.672 


(0.139) 


EN 


2.698 


(1.927) 


1.168 


(0.455) 


0.472 


(0.197) 


PLS 


2.385 


(0.629) 


1.210 


(0.318) 


0.768 


(0.133) 


PGR 


2.767 


(0.450) 


1.766 


(0.293) 


0.554 


(0.348) 


NEDE 


2.694 


(1.793) 


1.233 


(0.352) 


0.641 


(0.119) 


NALEDE 


2.702 


(1.925) 


1.126 


(0.445) 


0.407 


(0.130) 


NEDEP 


2.693 


(1.794) 


1.231 


(0.352) 


0.641 


(0.119) 


NALEDEP 


2.702 


(1.926) 


1.124 


(0.445) 


0.407 


(0.130) 



The results indicate that our estimators are not significantly more vari- 
able than existing ones, and our estimators perform competitively against 
existing estimators. Though our estimators are closely related to PGR, RR 
and EN, our estimators performed comparably to or better than these es- 
timators. PLS also did quite well, and our estimators did better than PLS 
in some cases. Increasing the noise in the predictors did not seem to signifi- 
cantly affect the qualitative performance of the estimators, except for OLS 
as explained above. 

In Section 5.2, we discussed how the converge rate of our linear estimators 
is of order n~^/^ which is in contrast to the typical convergence rate of n~^/^ 
for lasso-type regression [38]. We believe that this theoretical discrepancy is 
because our model has errors-in-variables while the standard model used in 
lasso- type regression does not [38] . These theoretical differences do not seem 
significant in practice. As seen in Table 1, our estimators can be competitive 
with existing lasso-type regression. 
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Table 2 

Averages and standard deviations over 200 replications of computation times m seconds 
for the different estimators using data generated by the linear model and 
nonlinear model given in Section 7.1 





n 


10 


n 


= 100 


n 


= 1000 








Linear model 






OLS/MP 




(0.000) 


0.001 


(0.000) 


0.001 


(0.000) 


RR 


0.055 


(0.005) 


0.063 


(0.005) 


0.110 


(0.004) 


EN 


L739 


(1.325) 


0.292 


(0.032) 


0.387 


(0.042) 


PLS 


L631 


(0.029) 


1.716 


(0.077) 


1.925 


(0.051) 


PGR 


0.185 


(0.006) 


0.192 


(0.010) 


0.253 


(0.007) 


EDE 


0.239 


(0.007) 


0.255 


(0.016) 


0.367 


(0.021) 


ALEDE 


2.071 


(1.363) 


0.715 


(0.043) 


0.912 


(0.059) 


EDEP 


0.411 


(0.011) 


0.523 


(0.019) 


0.731 


(0.052) 


ALEDEP 


2.248 


(1.366) 


0.990 


(0.042) 


1.283 


(0.083) 




n 


20 


n 


= 100 


n 


= 1000 








Nonlinear model 






OLS/MP 


• 


(0.000) 


0.001 


(0.000) 


0.001 


(0.000) 


RR 


0.359 


(0.039) 


0.470 


(0.028) 


0.938 


(0.040) 


EN 


0.916 


(0.433) 


1.005 


(0.058) 


1.917 


(0.269) 


PLS 


1.980 


(0.130) 


2.114 


(0.072) 


2.879 


(0.162) 


PGR 


1.014 


(0.062) 


1.118 


(0.038) 


1.911 


(0.166) 


NEDE 


0.840 


(0.078) 


1.066 


(0.058) 


2.031 


(0.092) 


NALEDE 


2.175 


(0.594) 


1.843 


(0.090) 


3.381 


(0.401) 


NEDEP 


1.403 


(0.120) 


1.726 


(0.087) 


2.277 


(0.175) 


NALEDEP 


2.671 


(0.647) 


2.513 


(0.122) 


4.631 


(0.397) 



7.2. Isomap face data. The Isomap face data^ from [53] consists of im- 
ages of an artificial face. The images are labeled with and vary depend- 
ing upon three variables: illumination-, horizontal- and vertical-orientation; 
sample images taken from this data set can be seen in Figure 2. Three- 
dimensional images of the face would form a three-dimensional manifold 
(each dimension corresponding to a variable), but this data set consists of 
two-dimensional projections of the face. Intuitively, a limited number of 
additional variables are needed for different views of the face (e.g., front, 
profile, etc.). This intuition is confirmed by dimensionality estimators which 
estimate that the two-dimensional images form a low-dimensional manifold 
[35]. 

To compare the different estimators, we did 100 replications of the fol- 
lowing experiment: we randomly split the data (n = 698 data points) into 



http : / / isomap . stanf ord. edu/datasets .html . 
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□ ^ a y 



Fig. 2. Sample images from the Isomap face data [53]. The images are labeled with and 
vary depending upon three variables: illumination-, horizontal- and vertical-orientation. 

Table 3 

Averages and standard deviations over 100 replications of validation set prediction 
error for different estimators using the Isomap face data [53]. The average computation 
time and its standard deviation is given in seconds, and it gives the time to estimate 
the horizontal pose angle of a single image 





Prediction 


error 


Computation time 


OLS/MP 


5.276 


(11.06) 


0.002 


(0.001) 


RR 


5.286 


(6.156) 


2.052 


(0.166) 


EN 


5.168 


(6.112) 


17.79 


(5.235) 


PLS 


10.12 


(14.84) 


16.22 


(0.841) 


PGR 


5.813 


(7.617) 


8.877 


(0.609) 


NEDE 


4.523 


(4.926) 


4.740 


(0.347) 


NALEDE 


4.409 


(4.889) 


22.94 


(5.349) 


NEDEP 


4.527 


(4.925) 


7.777 


(0.511) 


NALEDEP 


4.406 


(4.900) 


25.77 


(5.221) 



a training set rit = 688 and validation set n„ = 10, and then we used the 
training set to estimate the horizontal pose angle of images in the validation 
set. Since we are doing local linear estimation, the estimate for each image 
requires its own regression. The number of predictors p is large in this case 
because each data point Xi is a two-dimensional image. Estimation when p 
is large is computationally slow, and so we chose a small validation set size 
to ensure that the experiments completed in a reasonable amount of time. 
Replicating this experiment 100 times helps to prevent spurious results due 
to having a small validation set. 

To speed up the computations further, we scaled the images from 64 x 64 
pixels to 7 X 7 pixels in size. This is a justifiable approach because the images 
form a low-dimensional manifold, and so this resizing of the images does 
not lead to a loss in predictor information [56]. This leads to significantly 
faster computations, because this process reduces the number of predictors 
from p = 4096 to p = 49. In practice, our choice of p = 49 gives predictions 
that deviate from the true horizontal pose angle of images (which uniformly 
ranges between —75 to 75 degrees) in the validation set by a root-mean- 
squared error of three degrees or less. 
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Table 3 gives the prediction error of the models generated by different 
estimators on the validation set. The specific quantity provided is 



where V is the set of predictors in the validation set, (3^{Xi) is the first 
component of the estimated regression coefficients computed about the point 
Xq = Xi , and Yi is the corresponding horizontal pose angle of the image Xi . 
The regression is computed using only data taken from the training set. 
The results from this real data set shows that our estimators can provide 
improvements over existing tools, because our estimators have the lowest 
prediction errors. Table 3 also provides the computation times for estimating 
the horizontal pose, and it again shows that our estimators require more 
computation but not by an excessively larger amount. 

8. Conclusion. By interpreting collinearity as predictors on a lower-dimen- 
sional manifold, we have developed a new regularization, which has connec- 
tions to PGR and RR, for linear regression and local linear regression. This 
viewpoint also allows us interpret the regression coefficients as estimates of 
the exterior derivative. We proved the consistency of our estimators in both 
the classical case and the "large p, small n" case and this is useful from a 
theoretical standpoint. 

We provided numerical examples using simulated and real data which 
show that our estimators can provide improvements over existing estimators 
in estimation and prediction error. Though our estimators provide modest 
improvements over existing ones, these improvements are consistent over the 
different examples. Specifically, the Tikhonov-type and lasso-type regular- 
izations provided improvements, and the thresholding regularization did not 
provide major improvements. This is not to say that thresholding is not a 
good regularization, because as we showed: from a theoretical standpoint, 
thresholding does provide consistency in the "large p, small n" situation. 
This leaves open the possibility of future work on how to best select this 
thresholding parameter value. 

There is additional future work possible on extending our set of estima- 
tors. There is some benefit provided by shrinkage from the Tikhonov-type 
regularization which is independent of the manifold structure. Exploring 
more fully the relationship between manifold structure and shrinkage will 
likely lead to improved estimators. 



In this section, we provide the proofs of our theorems. We also give a few 
lemmas, which are needed for the proofs, that were not stated in the main 
text. 



(7.3) 
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Lemma A.l. If the assumptions in Section 2 hold, then: 

(a) ||E(C'22)-C22||2 = 0(/,4). 

(b) ||E[(Cf - C722)((722 - ||2 = Qil/uh'^); 

(C) Cn^C. 

Proof. This proof follows the techniques of [8, 46]. We first prove part 

(c) . Note that 



1=1 

and consider its expectation 

£((7^1) = E(^i^,(x, - xo)l{X E (Bl j^,-.))^ 

K{du(p ■ u)F{0) du + Oih"^), 

where we have used the assumption that K{-) is an even function, K'{-) is 
an odd function, and K"{-) is an even function. 
A similar calculation shows that for 

1 " 
1=1 

we have that the expectation is 

nci')=o{h)=o{i). 

And, a similar calculation shows that for 

1 " 

= f^d+2-p Y^h{xi - xo){xi - xo){xi - xo)', 



i=l 



we have that the expectation is 



nci')=F{Q)du^ 



K{du(j)(0) ■ u)uu' du 



du<P' + 0{h^). 



The result in part (c) follows from the weak law of large numbers. The last 
calculation also proves part (a). 
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Next, we prove part (b). For notational simplicity, let 
Ti = Kh{xi - xo)(xi - XQ){xi - xq)' . 

The variance is 

Var(C'f ) = ^2/,4+2d-2p Tr(n(E(r,7;0 -E(rOE(r,)0 

+ n(n - l)(E(T^,rj) - E(r,)E(r,)'))- 

Since Tj and Tj are independent, it follows that E(TjTj) — E(rj)E(rj)' = 0. 
Next, note that 



X [</.(z) -<^(0)][<A(z) -</.(0)]'F(z)dz + o(/i'^+2)\ 



= /i'^+4-2f F(0) / {K{du<P-u)fdu(l)-uu'-du^'-du<P 

X nn • du(l)' -du + Oih^) 
Thus, the variance is given by 

Var(C'f) = Tr f F(0) / {K{du(t) ■ u)f du^ ■ uu' ■ dj ■ d„(/> 



X uu' ■ du(t>' ■ du + Op{l)^ . |-| 



Lemma A. 2. // the assumptions in Section 2 hold, then the matrices 
On, C^^, n, tin, Pn, o,nd P havc the following properties: 

(a) rank(C22) = d and n{C^^) = TpM; 

(b) 7^(^)=AA(C22), AA(^)=7^(C22) andM{U)nM{C^^) = {0}; 

(c) \\Pn - Pg = \\Un - n||i = Opil/nh^); 

(d) P(rank(C„ + / nh'^+'^) =p+l)^l. 

Proof. To show property (a), we first show that for M € R'^^'^, where 
M = / K{du4>ifi) ■ u)uu du, 



we have that rank(M) = d. To prove this, choose any u G M'^ \ {0} and then 
consider the quantity 

v'Mv= / K{du4>{0) ■ u)v'uu'v du. 
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By construction, v'uu'v > almost everywhere. Additionally, since (p is three 
times differentiable, we have that K{du4>{0) • n) > on a set of nonzero 
measure and K{du(l){0)-u) > elsewhere. Thus, v'Mv > for aU t; G M'^\{0}. 
It follows that M is symmetric and positive definite with rank(M) = d. Since 
M is a d-dimensional manifold, we have that rank(du0) = d hy Corollary 
8.4 of [34]. The Sylvester Inequality [47] implies that 

rank(C22) = icank{du(j)Mdu(j)') = d, 

and this implies that 

n{C22) = n{du(t}Mdu(t>') = n{du(t)). 

However, Tl{du4') = TpM, where we take p = xq. This proves the result. 
We next consider property (b). We have that 

(Ji, . . . ,cjrf / and cr^+i = • • • = dp = 0, 

because rank(C22) = dhy property (a). Thus, the null-space of C22 is given 
by the column-span of Um\ however, the construction of P implies that the 
column-span of Un is the range-space of P. Ergo, 'Tl{P) =M{C22)- Note 
that the column-span of Ur belongs to the null-space of P, because each 
column in Ur is orthogonal — by property of the SVD — to each column in 
Un- Thus, we have the dual result that M{P) = 7^(6*22). The orthogonality 
of Ur and Un due to the SVD implies that M{P)r\M{C22) = {0}. 

Now, we turn to property (c). For h = Kn~^^^'^^^\ Lemma A.l says that 
jjC'^^ - C'22|||' = Op{l/nh'^). The result follows from Corollary 3 of [29], by 
the fact that I — Px is the projection matrix onto the null-space of X , and 
by the equivalence: 

\\Px - Pz\\l = \\sme[n{x),n{z)]\\, 

where Px is a projection matrix onto the range space of X [52] . 
Lastly, we deal with property (d). Lemma A.l shows that 



a, 4 C : 



Cii 

C722 



Since F{0) 7^ by assumption, Cu ^ 0; thus, rank(C) = 1 -|-rank(C22)- Since 
Af{P) nAf{C22) = {0}, we have that rank(C22 + \nP/nh'^+'^) =p. Conse- 
quently, rank(C) =p+l. Next, consider the expression 

\\Cn + A„P„M'^+2 -C- XnP/nh''+Y2 



<Op{h^) + 0. 
< Op{h). 



< \\Cn - CII2 + :^J^\\Pn - PWl 



n 
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Weyl's theorem [6] implies that 

(A.l) \\ai{Cn + A„P„M'^+2) - ai{C + KP/nh^+^)\\l < Op{h). 

Note that (Tj(C + \nP /nh'^^'^) is nondecreasing because Xn/nh'^'^^ is nonde- 
creasing. Define 

7? = min(cT,(C + A„P/n2/(^+4))), 
and consider the probabihty 

P(rank(C„ + XnPn/nh'^+^) =p + l) 

>F{\ai{Cn + XnPn/nh'^+^) 

(A.2) -ai{C + XnP/nh'^+^)\ < r],yi) 

P+i 

>J2^{\ai{Cn + XnPn/nh^+^) 
i=l 

-ai{C + XnP/nh''+^)\<T])-p. 
The result follows from (A.l) and (A.2). □ 

For notational convenience, we define 

i?„ = hP{f{x) - px,yw^,x,H-'/^ 

and 

M = ^[didj{fo(i))-dj-did,^]. 

Then, we have the following result concerning the asymptotic bias of the 
estimator: 

Lemma A. 3. If h = Kn~^/('^+^), then Bn A B, where 
(A.3) B= kF{0) / K{du(t>-u)uu' duM' Opxi 

Proof. First, recall the Taylor polynomial of the pullback of / to z: 

f{cl){z)) = fim) + d^f-du^-z + ^ d.djif o ,/.) . zz' + o{\\zf), 

where we have performed a pullback of dxf from T*M to T*U. In the 
following expression, we set z = hu: 

f{4>{z))-f{x^)-dj-[m-m\ 

= ^[^^^j{f o <t>) - dj ■ did,cp\uu' + o(h^^.h2. 



■h^Buu' + o{\\huf). 
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Because /3 = [/{xq) dxf]' , we can rewrite the expectation of the expression 



as 



E{Bn) = E ( hP-'^Knix - xo)(/(x) - x,„/3)'x 



xo 



K{du(t)-u)h^uuM' 



1 

l/hH 

1 



1 -ducj)- u + -didjcj)- uu' 



X (F(0) + hduF{0) ■u)}du 



F(0) / K{du^-u)uu'duM' + o{l) 0{y/h?) 



where the last hne follows because of the odd symmetries in the integrand. 
Since h = Kn~^/^'^^^\ this expectation becomes 

E(S„)=i? + o(l)li,<(p+i). 

The result follows from application of the weak law of large numbers. □ 



Let 
(A.4) 



y = F(o) / {K{du(t>-u)y 



1 

du4> ■ uu' ■ du4> 



du, 



then the following lemma describes the asymptotic distribution of the error 
residuals. 

Lemma A.4. If h = Kn~'^/^'^'^'^\ then 

Proof. Since E(e) = and e is independent of x, we have that 
E{^/^hPeKH{x - xo)xx,H~^/^) = 0. 
The variance of this quantity is 
\ai{hP./^eKH{x - xo)xx^H-^l^) 

= E{{hP^eKH{x - xo)x,,H~^/^)'{hPV^eKH{x - xq)x,,H~^/^)) 
= nh^Pa'^E{{KH{x-xa)f[l {x - x^)' ^ [I {x-xo)']) 

{K{d^<t^.u)f 1 {duc^-n + \d,dj^-uu^ 

du(l> ■ uu' ■ du4> 



a 
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X (F(0) + hduF{0) ■u)du + o{h^ 

= a^{V + o{h)I). 
Thus, the central hmit theorem impUes that 



n 



□ 



Proof of Theorem 4.2. This proof fohows the framework of [30, 59] 
but with significant modifications to deal with our estimator. For nota- 
tional convenience, we define the indices of /3 such that: /3o = /(a^o) and 
[/3i ••• /3p] = 4/. Let /3 = /3 + /i'-V2^ and 



+ An||P„-(/3 + F-V2 



u 



Let = argmin^„(n); then = /? + i^"^/^^^"). Note that *„(n) - 
^„(0) = T4(")(u), where 

+ 2{hP{Y - X^.pyW^.X,, + XnP'Pn)H-^/\. 

If Xn/nh'^^'^ —7- oo and hXn / nh'^^'^ — )■ 0, then for every u 

A„/3'P„F-^/\ = Xnl3'Pu/Vnh^0p{l) + Xnh/nh'^+'^Op{l), 

where we have used Lemma A. 2. It follows from the definition of j3 (4.5) and 
Lemma A.2 that p'P = 0; thus, Xn/3' PnH-^/^u = Xnh/nh'^+'^Op{l) = Op{l). 
For all u G M{P), we have 

Xn/nh'^+^u'PnU = Xn/nh'^+^Op{l/nh'^) = op{hXn/nh'^+^), 

and for all u ^ J\f{P), we have 

Xn/nh'^+'^u'PnU = Xn/nh'^+'^u'PnuOp{l) oo. 

Let W ^ M{0,a^V). Then, by Slutsky's theorem we must have that 
vi''\u)AVi{u) for every u, where 



u'Cu-2u'{W + B), ifnGAA(P), 
oo, otherwise. 



Lemma 5 shows that vj'^\u) is convex with high-probability, and Lemma 
5 also shows that V4{u) is convex. Consequently, the unique minimum of 
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V4{u) is given hy u = C^{W + B), where denotes the Moore-Penrose 
pseudoinverse of C. Fohowing the epi-convergence results of [20, 30], we have 

that -^C^W + B). This proves asymptotic normahty of the estimator, 
as well as convergence in probability. 

The proof for the NALEDE estimator comes for free. The proof formu- 
lation that we have used for the consistency of nonparametric regression 
in (4.6) allows us to trivially extend the proof of [59] to prove asymptotic 
normality and consistency. □ 

Lemma A. 5. Consider An, Bn €MP'^^'P" that are symmetric, invertible 
matrices. If \\An - Bnh = Op{-/n) , \\A~'^\\2 = Op{l) and \\B~^\\2 = Op{l), 
then \\A~^ - B~^\\2 = Op{-fn)- 

Proof. Consider the expression 

ll^n^ - -^n^lb = \\A~-^{Bn " An)B~^\\2 

^ Mn'^lb • ll^n — Bn\\2- ||-B„ """112, 

where the last line follows because the induced, matrix norm || • ||2 is sub- 
multiplicative for square matrices. □ 

Proof of Theorem 5.3. Under our set of assumptions, the results 
from [7] apply: 



(A.5) \\Tt{X'X/n) - {^i: + all)\\2 = oJ c, 



(A.6) \\Tt{X'Y/n) - J^^^h = Op 

An argument similar to that given in Lemma A. 2 implies that 

ll-Pn - Pn\\ = Op{Cny/logp/n). 

Consequently, it holds that 

\\Tt{X'X/n) - all + A„P„ - (S^ + A„P„)||2 

(A.7) 

= OpL{K + l)/^\ 

Next, observe that 

S^ + A„P„ = [[/r UN]diag{ai,. . . ,ad,Xn,- ■ ■ ,K)[Ur Un]' ■ 
Recall that we only consider the case in which d <p. We have that: 
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(a) + XnPny% = 0(1), because of (5.2); 

(b) - (Eg + A„P„)-i||2 = Op(l/A„). 

Weyl's theorem [6] and (A. 7) imply that 

\\{Tt{X'X/n) - all + XnPnr'h = Op{l). 
Additionally, Lemma A. 5 implies that 



\\{Tt{X'X/n) - all + A„P„)"' - (S^ + A„P„)-i|| = Op [ c„A„i 



n 



Note that the solution to the estimator defined in (5.6) is: 
/3 = {Tt{X'X/n) - all + A„P„)"'Tt(X'y/n). 

Next, we define 

/3(") ^ {Tt{X'X/n) - all + XnPnr%/3, 

and note that the projection matrix onto the range space of is given by 
Ps^ = S^S^. Thus, /3 = Pj^^P = S^S^/?. Consequently, we have that 

11/3 -/?||2< 11/3 -/3(")||2 + ||/3(")-/3||2 
(A.8) < \\{Tt{X'X/n) - all + XnPnr'h ' \\Tt{X'Y/n) - ^^Ph 

+ \\{TtiX'X/n) - all + XnPnr' - ' W^^Ph, 

where the inequality comes about because || • ||2 is an induced, matrix norm 
and the expressions are of the form ]RP^p(]RP^P]R^^). Recall that for symmet- 
ric matrices, = ; ergo, \\A\\2 < y^||^||i ||A||cxd = ||^||i- Because of 
(5.1), we can use this relationship on the norms to calculate that 115^^112 = 
0{cn) and ||/3|| = 0(c„). Consequently, 



(A.8) < Op { CnXn\r^ ) + Op{cl/Xn). 



The result follows from the relationship 

1/4 



Xn — Oi \fc^ 



logp 



n 

□ 



We can show that the bias of the terms of the nonparametric exterior 
derivative estimation goes to zero at a certain rate. 
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Lemma A. 6. Under the assumptions of Section 5, we have that 

mCn]ij)-[Cn]ij\=0{h^cl{2nf''), 
mRnh) - [Rnh\=0{h'cl{2^f''), 

where i,j denote the components of the matrices. Similarly, we have that 

Var([n(7„],j) = 0(l//i''), 
Yar {[nRn]^J) = {I /h'^). 

Proof. By the triangle inequality and a change of variables, 



Bias(C^^) 



13 d 

"o,n/h 



h 



< 



"o,n 



K 



(pjhu) - 0(0) 
h 



F{z)dz 



K{du(l) ■ u) 



K{du<t)-u)[F{hu)- F{f))]du 

The Taylor remainder theorem implies that 
^,>(/in)-0(O)^ 



K{du(t)-u)F{Q)du 
F{hu) du 
T1 + T2. 



h 

K{du4>-u) 

+ dkK{du(p ■ u) X {hdij(j)''\ou'u^/2 + dijm<p''\wu'u^u"' /6) 
+ dkiK{v)/2 X {hd,jq)''\ou'u^/2 + h^dijm<l)^\^u'u^u"'/6) 
X {hdijcp^lou'u^ /2 + h'^dijm(l)^\^u'u^u'^/6), 



where w e B^o and v e B'^ , i,a j,k\ i i /o , ;,2 o A.k\ i , m /«> and 

F{hu) = F(0) + hdiF\ou' + h"^ dijF\^u'u^ /2, 
where i; G (0, hu). 

The odd-symmetry components of the integrands of Ti and T2 will be 
equal to zero, and so we only need to consider even-symmetry terms of the 
integrands. Recall that K [■) , d^K [■) , d^iK {■) are, respectively, even, odd and 
even. By the sparsity assumptions, we have that 

Ti=0{h^d^cl{2^Y), 

T2 = 0{h'^d^{2nY). 

Consequently, Ti + = 0{h'^d^cl{2^Y) = 0{h^cl{2nf'^). 
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We can compute the variance of nC^^ to be 

Var(nCii) = / h^'^'iKmz) - m)/h)f{F{z)f dz 



"o,n/h 



= h-'' [ [Kmhu) - m)/h)f{F{hu)f dy 

-(E(nCii))2 
= 0(l//i'^). 

The remainder of the results follow by similar, lengthy calculations. Note 
that for the variance of terms involving YJ, a cj^ coefficient appears, but this 
is just a finite-scaling factor which is irrelevant in 0-notation. □ 

Proof of Theorem 5.1. The key to this proof is to provide an expo- 
nential concentration inequality for the terms in C„ and Rn- Having done 
this, we can then piggyback off of the proof in [7] to immediately get the 
result. The proofs for Cn and Rn are identical; so we only do the proof for 

Cn- 

Using the Bernstein inequality [36] and the union bound, 
¥(max\\[Cnhj -nCnhW > t 

2 f 

< 2p exp 

V 2 Var(n[C„]ij) + max(|n[C„]ij|)2t/3, 

Since the ith component of X obeys: \[X]i\ < M, it follows that 

max(|n[C'„],,•|)=2M//l^ 

where r] £ {0, 1, 2} depending on i and j. Using this bound and Lemma A. 6 
gives 



max|[C„]ij - E[Cn]ij| = OpU logp/n¥). 



Recall that 



ma.x\[Cn]ij - [Cn]ij\ < m.sx\[Cn]ij -E[Cn]ij\ +max|E([C„]jj) - [Cn]ij\. 

1,3 *J 



However, this second term is o{y^logp/nh^). Consequently, 



(A.9) max\[Cn]^j - [CnU = Op{J\ogp/nhd). 

Using (A.9), we can follow the proof of Theorem 1 in [7] to prove the result. 

□ 
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